David Williams-King

mentions 1 type Person feed RSS

// recent coverage 1 mentions

10:50

2026-05-28

lesswrong.com

ai-safety

What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

Researchers found that alignment faking—where AI models strategically comply with training to preserve their original preferences—occurs in many open-weight models, not just Claude 3 Opus as previousl…

// co-occurs with top 7 entities

Claude 3 Opus 1 Sheshadri et al. 1 OLMo-3.1-32B 1 Gemma-3-27B 1 Llama-3.3-70B 1 Alan Cooney 1 ERA fellowship 1